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Abstract 

We  describe  a  new  sequential  learning  scheme 
called  “stacked  sequential  learning” .  Stacked 
sequential  learning  is  a  meta-learning  algo¬ 
rithm,  in  which  an  arbitrary  base  learner  is 
augmented  so  as  make  it  aware  of  the  la¬ 
bels  of  nearby  examples.  We  evaluate  the 
method  on  several  “sequential  partitioning 
problems”,  which  are  characterized  by  long 
runs  of  identical  labels.  We  demonstrate 
that  on  these  problems,  sequential  stack¬ 
ing  consistently  improves  the  performance  of 
non-sequential  base  learners;  that  sequential 
stacking  often  improves  performance  of  learn¬ 
ers  (such  as  CRFs)  that  are  designed  specifi¬ 
cally  for  sequential  tasks;  and  that  a  sequen¬ 
tially  stacked  maximum-entropy  learner  gen¬ 
erally  outperforms  CRFs. 


1  Introduction 

In  this  paper,  we  will  consider  the  application  of  se¬ 
quential  probabilistic  learners  to  sequential  partition¬ 
ing  tasks.  Sequential  partitioning  tasks  are  sequential 
classification  tasks  characterized  by  long  runs  of  iden¬ 
tical  labels:  examples  of  these  tasks  include  document 
analysis,  video  segmentation,  and  gene  finding. 

Motivated  by  some  anomolous  behavior  observed  for 
one  sequential  learning  method  on  a  particular  par¬ 
titioning  task,  we  will  derive  a  new  learning  scheme 
called  stacked  sequential  learning.  Like  boosting, 
stacked  sequential  learning  is  a  meta-learning  method, 
in  which  an  arbitrary  base  learner  is  augmented — in 
this  case,  by  making  the  learner  aware  of  the  labels  of 
nearby  examples.  Sequential  stacking  is  simple  to  im¬ 
plement,  can  be  applied  to  virtually  any  base  learner, 
and  imposes  only  a  constant  overhead  in  training  time: 
in  our  implementation,  the  sequentially  stacked  ver¬ 


sion  of  the  base  learner  A  trains  about  seven  times 
more  slowly  than  A. 

In  experiments  on  several  partitioning  tasks,  sequen¬ 
tial  stacking  consistently  improves  the  performance 
of  non-sequential  base  learners.  More  surprisingly, 
sequential  stacking  also  often  improves  performance 
of  learners  specifically  designed  for  sequential  tasks, 
such  as  conditional  random  fields  and  discriminatively 
trained  HMMs.  Finally,  on  our  set  of  benchmark  prob¬ 
lems,  a  sequentially  stacked  maximum-entropy  learner 
generally  outperforms  conditional  random  fields. 

2  Motivation 

2.1  A  Task  for  Which  MEMMs  Fail 

To  motivate  the  novel  learning  method  that  we  will 
describe  below,  we  will  first  analyze  the  behavior  of 
one  well-known  sequential  learner  on  a  particular  real- 
world  problem.  In  a  recent  paper  [2  ,  we  evaluated  a 
number  of  sequential  learning  methods  on  the  prob¬ 
lem  of  recognizing  the  “signature”  section  of  an  email 
message.  Each  line  of  an  email  message  was  repre¬ 
sented  with  a  set  of  hand-crafted  features,  such  as  “line 
contains  a  possible  phone  number”,  “line  is  blank”, 
etc.  Each  email  message  was  represented  as  a  vector  x 
of  feature- vectors  xi, . . .  ,Xn,  where  Xi  is  the  feature- 
vector  representation  of  the  Ath  line  of  the  message. 
A  line  was  labeled  as  positive  if  it  was  part  of  a  signa¬ 
ture  section,  and  negative  otherwise.  The  labels  for  a 
message  were  represented  as  another  vector  y,  where 
Pi  is  the  label  for  line  i. 

The  dataset  contains  about  33,013  labeled  lines  from 
617  email  messages.  About  10%  of  the  lines  are  la¬ 
beled  “positive”.  Signature  sections  always  fall  at  the 
end  of  a  message,  usually  in  the  last  10  lines.  In  the 
experiments  below,  the  data  was  split  into  a  training 
set  (of  438  sequences/emails),  and  a  test  set  with  the 
remaining  sequences,  and  we  used  the  “basic”  feature 
set  from  Carvalho  &  Cohen. 


Report  Documentation  Page 


Form  Approved 
0MB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 


1.  REPORT  DATE 

JUL  2005 

4.  TITLE  AND  SUBTITLE 

Stacked  Sequential  Learning 


6.  AUTHOR(S) 


2.  REPORT  TYPE 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Carnegie  Mellon  University, Center  for  Automated  Learning  & 
Discovery, 5000  Forbes  Ave, Pittsburgh, PA, 15213 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


3.  DATES  COVERED 

00-00-2005  to  00-00-2005 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

8.  PEREORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

We  describe  a  new  sequential  learning  scheme  called  stacked  sequential  learning".  Stacked  sequential 
learning  is  a  meta-learning  algo-  rithm,  in  which  an  arbitrary  base  learner  is  augmented  so  as  make  it 
aware  of  the  la-  bels  of  nearby  examples.  We  evaluate  the  method  on  several  sequential  partitioning 
problems",  which  are  characterized  by  long  runs  of  identical  labels.  We  demonstrate  that  on  these 
problems,  sequential  stack-  ing  consistently  improves  the  performance  of  non-sequential  base  learners; 
that  sequential  stacking  often  improves  performance  of  learn-  ers  (such  as  CRFs)  that  are  designed  speci?- 
cally  for  sequential  tasks;  and  that  a  sequen-  tially  stacked  maximum-entropy  learner  gen-  erally 
outperforms  CRFs 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIEICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

8 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


The  complete  dataset  is  represented  as  a  set  S  of 
examples  S  =  {(xi,yi), . . . ,  (xt,yt), . . . ,  (x„,ym)}. 
Sequential  learning  is  the  problem  of  learning,  from 
such  a  dataset,  a  sequential  classifier — i.e.,  a  func¬ 
tion  /  such  that  /(x)  produces  a  vector  of  class  labels 
y.  Clearly,  any  ordinary  non-sequential  learning  algo¬ 
rithm  can  be  used  for  sequential  learning,  by  ignoring 
the  sequential  nature  of  the  data^ . 

In  the  previous  paper  2_,  we  reported  results  for 
several  non-sequential  and  sequential  learners  on 
the  signature-detection  problem,  including  a  non¬ 
sequential  maximum  entropy  learner  1  (henceforth 
ME)  and  conditional  random  fields  [8J  (henceforth 
CRFs).  Another  plausible  sequential  learning  method 
to  apply  to  this  task  are  maximum- entropy  Markov 
models  (MEMMs)  [9_,  also  called  maximum- entropy 
taggers  [11  ,  conditional  Markov  models  [7_,  and  recur¬ 
rent  sliding  windows  [4J .  In  this  model,  the  conditional 
probability  of  a  label  sequence  y  given  an  instance  se¬ 
quence  X  is  defined  to  be 

Pr(y|x)  =  ]^Pr(yi|?/i_i,a;i)  (1) 

I 

The  local  model  Vr{yi\yi-i,Xi)  is  learned  as  follows. 
First  one  constructs  an  extended  dataset,  which  is 
a  collection  of  non-sequential  examples  of  the  form 
{{xi,yi-i),yi),  where  {xi,yi-i)  denotes  an  instance  in 
which  the  original  feature  vector  for  Xi  is  augmented 
by  adding  a  feature  for  We  will  call  {xi,yi-i) 

an  extended  instance,  and  call  yi-i  a  history  feature. 
Note  that  yi  is  the  class  label  for  the  extended  example 
{{x^,yi-l),yi)■ 

After  constructing  extended  instances,  one  trains  a 
maximum-entropy  conditional  model  from  the  ex¬ 
tended  dataset.  Inference  is  done  by  using  a  Viterbi 
search  to  find  the  best  label  sequence  y  according  to 
Equation  1, 

MEMMs  have  a  number  of  nice  properties.  Rela¬ 
tive  the  more  recently-proposed  CRF  model,  MEMMs 
are  easy  to  implement,  and  (since  no  inference  is 
done  at  learning  time)  relatively  quick  to  train. 
MEMMs  can  also  be  easily  generalized  by  replacing 
the  local  model  with  one  that  uses  a  longer  “his¬ 
tory”  of  k  previous  labels — i.e.,  a  model  of  the  form 
Pr(j/i|j/i_i, . . . ,  Xi) — and  replacing  the  Viterbi 

search  with  a  beam  search.  Such  a  learner  scales  well 
with  the  history  size  and  number  of  possible  classes  y. 

^Specifically,  one  could  build  a  dataset  of  non-sequential 
examples  {xt,i,yt,i)  from  S,  and  use  it  to  train  a  classifier 
g  that  maps  a  single  feature-vector  a;  to  a  label  y.  One 
can  then  use  g  to  classify  each  instance  Xi  in  the  vector 
X  =  {xi, . . .  ,Xn)  separately,  ignoring  its  sequential  posi¬ 
tion,  and  append  the  resulting  predictions  yi  into  an  out¬ 
put  vector  y. 


Method 

Noise 

Error 

Min  Error 

ME 

3.47 

3.20 

MEMM 

31.83 

4.26 

CRF 

1.17 

1.17 

MEMM 

10% 

2.18 

2.18 

CRF 

10% 

1.85 

1.84 

Table  1:  Performance  of  several  sequential  learners  on 
the  signature-detection  problem. 

Unfortunately,  as  Table  1  shows,  MEMMs  perform 
extremely  badly  on  the  signature-detection  problem, 
with  an  error  rate  many  times  the  error  rate  of  CRFs. 
In  fact,  on  this  problem,  MEMMs  perform  much  worse 
than  the  non-sequential  maximum-entropy  learner 
ME,  or  even  the  default  error  rate.^ 

The  MEMM’s  performance  is  better  if  one  is  allowed 
to  change  the  threshold  used  to  classify  examples.  Let¬ 
ting  Pi  be  the  probability  Pr(2/i  =  -\-\xi,yi-i)  as  com¬ 
puted  by  the  local  model  in  the  Viterbi  classification 
of  X,  we  computed,  for  each  learner,  the  threshold  9 
such  the  rule  [{yi  =  -b)  {pi  >  0)]  gives  the  lowest 
test  error  rate.  The  column  labeled  “Min  Error”  in 
Table  1  gives  this  result.  (Of  course,  since  the  compu¬ 
tation  of  9  was  done  using  the  test  data,  this  is  only 
a  lower  bound  on  attainable  error  rate.)  The  “Min 
Error”  for  MEMMs  is  much  lower  than  the  error  for 
MEMMs  with  the  default  threshold,  but  still  higher 
than  either  non-sequential  ME  or  CRFs. 

2.2  Analysis 

The  literature  suggests  several  possible  explanations 
for  these  results.  For  instance,  Lafferty  et  al  [8  show 
that  MEMMs  can  represent  only  a  proper  subset  of  the 
distributions  that  can  be  represented  by  CRFs  (the 
“label  bias  problem”).  However,  “label  bias”  does 
not  explain  why  MEMMs  perform  worse  than  non¬ 
sequential  ME,  since  MEMMs  clearly  can  represent  a 
proper  superset  of  the  distributions  that  ME  can  rep¬ 
resent.  Klein  and  Manning  [7  describe  an  “observa¬ 
tion  bias  problem”,  in  which  MEMMs  give  too  little 
weight  to  the  history  features.  Error  analysis  on  the 
signature-detection  task  suggests  that  the  opposite  is 
happening  here:  relative  to  the  weights  assigned  by 
a  CRF,  MEMM  is  actually  giving  too  much  weight 
to  the  history  features,  and  too  little  to  the  features 
from  Xi-  The  conjecture  that  the  history  features  are 

^We  used  the  implementations  of  ME,  MEMMs,  and 
CRFs  provided  by  Minorthird  10  ,  which  uses  Gaussian 
priors  and  a  limited-memory  quasi-Newton  method  for  op¬ 
timization.  A  limit  of  50  optimization  iterations  was  also 
used,  although  this  limit  does  not  substantially  change  the 
result  of  this  section. 


being  overweighted  is  also  consistent  with  the  empiri¬ 
cal  observation  that  on  many  test  email  messages,  the 
learned  MEMM  makes  a  false  positive  classification 
somewhere  before  the  signature  starts,  and  then  “gets 
stuck”  and  marks  every  subsequent  line  as  part  of  a 
signature. 

To  test  this  theory,  we  encouraged  the  MEMM  to 
downweight  the  history  features  by  adding  noise  to 
the  training  (not  test)  data,  as  follows.  For  each  train¬ 
ing  email/sequence  x,  we  consider  each  feature-vector 
Xi  S  X  in  turn,  and  toss  a  coin  with  a  10%  chance  of 
landing  “heads” .  If  the  coin  flip  comes  up  “heads” ,  we 
swap  Xi  with  some  other  feature-vector  Xj  chosen  uni¬ 
formly  from  X.  Adding  this  “sequence  noise”  almost 
doubles  the  error  rate  for  CRFs,  but  greatly  reduces 
the  error  rate  for  MEMMs.  (Of  course,  this  type  of 
noise  does  not  affect  non-sequential  ME.)  This  exper¬ 
iment  further  supports  the  hypothesis  that  MEMM  is 
overweighting  history  features. 

3  Stacked  Sequential  Learning 

3.1  Description 

The  poor  results  for  MEMM  described  above  can  be 
intuitively  explained  as  a  mismatch  between  the  data 
used  to  train  the  local  models  of  the  MEMM,  and  the 
data  used  to  test  the  model.  With  noise-free  train¬ 
ing  data,  it  is  always  the  case  that  a  signature  line 
is  followed  by  more  signature  lines,  so  it  is  not  espe¬ 
cially  surprising  that  the  MEMM’s  local  model  tends 
to  weight  this  feature  heavily.  However,  this  regularity 
need  not  always  hold  for  the  test  data,  which  is  drawn 
from  predictions  made  by  the  local  model  on  different 
examples. 

In  theory,  of  course,  this  training/test  mismatch  is 
compensated  for  by  the  Viterbi  search,  which  is  in  turn 
driven  by  the  confidence  estimates  produced  by  the  lo¬ 
cal  model.  However,  if  the  assumptions  of  the  theory 
are  violated  (for  instance,  if  there  are  high-order  in¬ 
teractions  not  accounted  for  by  the  maximum-entropy 
model),  the  local  model’s  confidence  estimates  may  be 
incorrect,  leading  to  poor  performance. 

To  correct  the  training/test  mismatch,  it  is  sufficient 
to  modify  the  the  extended  dataset  so  that  the  true 
previous  class  yi-i  in  an  extended  instance  (xi,?/i_i) 
is  replaced  by  a  predicted  previous  class  iji-i.  Below 
we  will  outline  one  way  to  do  this. 

Assume  that  one  is  given  a  sample  S  =  {(xt,yi)}  of 
size  TO,  and  a  sequential  learning  algorithm  A.  Pre¬ 
vious  work  on  a  meta-learning  method  called  stacking 
13J  suggests  the  following  scheme  for  constructing  a 
sample  of  (x,  y)  pairs  in  which  y  is  a  vector  of  “pre- 


Stacked  Sequential  Learning. 

Parameters:  a  history  size  144,,  a  future  size  Wf,  and  a 
cross-validation  parameter  K. 

Learning  algorithm:  Given  a  sample  S  =  {(xt,yt)},  and  a 
sequential  learning  algorithm  A: 

1.  Construct  a  sample  of  predictions  yt  for  each  xt  G  4 
as  follows: 

(a)  Split  S  into  K  equal-sized  disjoint  subsets 
5i,...,5k 

(b)  Fori  =  l,...,A,  let/,  =A(4-S,) 

(c)  Let  S  =  {(xt,yt)  :  ft  =  and  Xt  G  Sj} 

2.  Construct  an  extended  dataset  S'  of  instances  (x(,  yt) 
by  converting  each  xt  to  x(  as  follows:  xt^  = 
{x'l,. . .  where  x'  =  {xi,yi-w,,,  ■  ■  ■  ,yi+Wf)  and 
yt  is  the  i-th  component  of  yt,  the  label  vector  paired 
with  Xt  in  S. 

3.  Return  two  functions:  /  =  A{S)  and  f'  =  A{S'). 
Inference  algorithm:  given  an  instance  vector  x: 

1.  Let  y  =  /(x) 

2.  Carry  out  Step  2  above  to  produce  an  extended  in¬ 
stance  x'  (using  y  in  place  of  yt). 

3.  Return  /'(x'). 

Table  2:  The  sequential  stacking  meta-learning  algo¬ 
rithm. 


dieted”  class-labels  for  x.  First,  partition  S  into  K 
equal-sized  disjoint  subsets  Si, . . . ,  Sk,  and  learn  K 
functions  /i,...,/iy,  where  fj  =  A{S  —  Sj).  Then, 
construct  the  set 

S  =  {(xt,yt)  :  y  =  /,(xt)  and  xt  G 

In  other  words,  S  pairs  each  Xj  with  the  ft  associated 
with  performing  a  AT-fold  cross-validation  on  S.  The 
intent  of  this  method  is  that  y  is  similar  to  the  pre¬ 
diction  produced  by  an  /  learned  by  A  on  a  size-m 
sample  that  does  not  include  x. 

This  procedure  is  the  basis  of  the  meta-learning  algo¬ 
rithm  of  Table  2 ,  This  method  begins  with  a  sample  S 
and  a  sequential  learning  method  A.  In  the  discussion 
below  we  will  assume  that  A  is  ME,  used  for  sequential 
data. 

Using  S,  A,  and  cross-validation  techniques,  one  first 
pairs  with  each  Xj  G  S'  the  vector  ft  associated  with 
performing  cross-validation  with  ME.  These  predic¬ 
tions  are  then  used  to  create  a  dataset  S'  of  extended 
instances  x',  which  in  the  simplest  case,  are  simply 
vectors  composed  of  instances  of  the  form  {xi,yi-i), 
where  yt-i  is  the  {i  —  l)-th  label  in  y. 

The  extended  examples  S'  are  then  used  to  train 


Figure  1:  Stacked  sequential  learning,  varying  history 
size  {Wh)  and  window  size  {W  =  Wh  =  Wf).  The 
base  learning  algorithm  A  is  maximum-entropy  (ME), 
unless  otherwise  stated. 

a  model  /'  =  A(S'').  If  A  is  the  non-sequential 
maximum-entropy  learner,  this  step  is  similar  to  the 
process  of  building  a  “local  model”  for  an  MEMM:  the 
difference  is  that  the  history  features  added  to  Xi  are 
derived  not  from  the  true  history  of  Xi,  but  are  (ap¬ 
proximations  of)  the  off-sample  predictions  of  an  ME 
classifier. 

At  inference  time,  f'  must  be  run  on  examples  that 
have  been  extended  by  adding  prediction  features  y. 
To  keep  the  “test”  distribution  similar  to  the  “train¬ 
ing”  distribution,  /  will  not  be  used  as  the  inner  loop  of 
a  Viterbi  or  beam-search  process;  instead,  the  predic¬ 
tions  y  are  produced  using  a  non-sequential  maximum- 
entropy  model  /  that  is  learned  from  S.  The  algorithm 
of  Table  2  simply  generalizes  this  idea  from  ME  to  an 
arbitrary  sequential  learner,  and  from  a  specific  his¬ 
tory  feature  to  a  parameterized  set  of  features. 

In  our  experiments,  we  introduced  one  small  but  im¬ 
portant  refinement:  each  “history  feature”  y  added  to 
an  extended  example  is  not  simply  a  predicted  class, 
but  a  numeric  value  indicating  the  log-odds  of  that 
class.  This  makes  accessible  to  /'  the  confidences  pre¬ 
viously  used  by  the  Viterbi  search. 

3.2  Initial  results 

We  applied  stacked  sequential  learning  with  ME  as 
the  base  learner  (henceforth  s-ME)  to  the  signature- 
detection  dataset.  We  used  K  =  5,  Wh  =  1,  and 
Wf  =  0.  (Notice  that  with  these  parameters  the  ex¬ 
tended  instance  constructed  from  Xi  includes  iji  as  well 
as  iji-i.)  The  s-ME  method  obtains  an  error  rate  of 
2.63%  on  the  signature-detection  task.  This  is  less 
than  the  baseline  ME  method  (3.20%)  but  still  higher 


than  CRFs  (1.17%).  However,  three  extensions  to  s- 
ME  are  straightforward  to  implement,  and  dramati¬ 
cally  improve  performance. 

More  past  labels.  Like  MEMMs,  s-ME  can  efficiently 
handle  a  large  “history”  of  previous  predicted  classes. 
In  fact,  s-ME  can  handle  large  histories  more  easily 
than  MEMMs,  as  it  does  not  need  to  resort  to  beam 
search  for  inference — the  only  impact  of  more  history 
features  is  to  add  new  features  to  the  extended  in¬ 
stances.  On  the  signature-detection  task,  increasing 
the  history  size  reduces  error  to  2.38%  (with  a  history 
size  of  11)  as  is  shown  in  Figure  1, 

Past  and  future  labels.  Unlike  MEMMs,  the  extended 
instance  for  Xi  can  include  predicted  classes  not  only 
of  previous  instances,  but  also  of  “future”  instances — 
instances  that  follow  Xi  in  the  sequence  x.  We  explored 
different  “window  sizes”  for  s-ME,  where  a  “window 
size”  of  W  means  that  Wh  =  Wf  =  W,  i.e.,  the  W 
previous  and  W  following  predicted  labels  are  added 
to  each  extended  instance.  This  reduces  error  rates 
substantially,  to  only  0.71%.  This  is  a  46%  reduction 
from  CRF’s  error  rate  of  1.17%.  The  improvement  is 
also  statistically  significant.^ 

Used  in  this  way,  s-ME  is  a  sort  of  bidirectional  model, 
broadly  similar  to  the  model  proposed  by  Toutanova  et 
al  for  part  of  speech  tagging  [12  .  We  note  that  here, 
as  in  Toutanova’s  results,  it  is  more  valuable  to  use 
information  about  both  the  previous  and  future  labels 
than  to  consider  only  previous  labels. 

Different  base  learners.  Stacked  sequential  learning 
can  be  applied  to  any  learner;  in  particular,  since  the 
extended  examples  are  sequential,  it  can  be  applied 
any  sequential  learner.  We  evaluated  stacked  sequen¬ 
tial  CRFs  (henceforth  s-CRFs)  with  varying  window 
sizes  on  this  problem.  As  shown  in  Figure  1,  s-CRFs 
also  outperform  CRFs,  and  again,  the  difference  is 
both  substantial  and  statistically  significant.  However, 
with  large  window  sizes,  there  is  little  difference  in  per¬ 
formance  between  s-CRF  and  s-ME. 

3.3  Discussion 

A  graphical  view  of  a  MEMMs  is  shown  in  Part  (a)  of 
Figure  2,  We  use  the  usual  convention  in  which  nodes 
for  known  values  are  shaded.  Each  node  is  associated 
with  a  maximum-entropy  conditional  model  which  de¬ 
fines  a  probability  distribution  given  its  input  values. 

Part  (b)  of  the  figure  presents  a  similar  graphical  view 
of  the  classifier  learned  by  sequential  stacking.  (The 

^Specifically,  a  two-tailed  paired  t-test  rejects  with  > 
95%  confidence  the  null  hypothesis  that  the  difference  in 
error  rate  between  s-ME  and  CRF  on  a  randomly  selected 
sequence  x  has  a  mean  of  zero. 


(a)  Maximum-entropy  Markov  model  (MEMM) 


(b)  Sequential  stacking 


(c)  Sequential  stacking  with  width  W  =  2 


(d)  Two-level  sequential  stacking 


Figure  2:  Graphical  views  of  alternative  sequential-stacking  schemes. 


figure  shows  sequential  stacking  for  the  default  setting 
of  Wh  =  1  and  Wf  =  0.)  Inference  in  this  model  is 
done  in  two  stages:  first  the  middle  layer  is  inferred 
from  the  bottom  later,  then  the  top  layer  is  inferred 
from  the  middle  layer.  The  nodes  in  the  middle  layer 
are  partly  shaded  to  indicate  that  their  hybrid  status — 
they  are  considered  outputs  by  the  model  /,  and  inputs 
by  the  model  f . 

One  way  to  interpret  the  hybrid  layer  is  as  a  means  of 
making  the  inference  more  robust.  If  the  middle-layer 
nodes  were  treated  as  ordinary  unobserved  variables, 
the  top-layer  conditional  model  (/')  would  rely  heavily 
on  the  confidence  assessments  of  the  lower-layer  model 
(/).  Forcing  /'  treat  these  variables  as  observed  quan¬ 
tities  allows  f'  to  develop  its  own  model  of  how  the  y 
predictions  made  by  /  correlate  with  the  actual  out¬ 
puts  y.  This  allows  f  to  accept  or  downweight  /’s 
predictions,  as  appropriate.  As  suggested  by  the  dot¬ 
ted  line  in  the  figure,  stacking  conceptually  creates  a 


“firewall”  between  /  and  /',  insulating  f  from  possi¬ 
ble  errors  in  confidence  made  by  /. 

Part  (c)  of  the  figure  shows  a  sequential  stacking  model 
with  a  window  of  Wh  =  Wf  =  2.  To  simplify  the 
figure,  only  the  edges  that  eventually  lead  to  the  node 
Yi  are  shown. 

Part  (d)  of  the  figure  shows  another  plausible  exten¬ 
sion  of  sequential  stacking,  in  which  each  y  is  replaced 
with  a  better  approximation  of  y — namely,  the  output 
of  sequential  stacking  itself.  (Again  to  simplify  the  fig¬ 
ure,  a  minimal  set  of  arcs  are  shown,  in  this  case  for 
stacking  with  11^  =  1  and  Wf  =  —1.)  This  “deeper” 
stacking  scheme  can  be  implemented  quite  easily,  for 
instance  by  applying  the  sequential  stacking  scheme 
to  the  base  learner  s-ME.  However,  our  initial  experi¬ 
ments  were  discouraging:  for  instance,  the  depth-two 
learner  s-(s-ME)  has  a  slightly  higher  error  rate  than 
s-ME  (3.04%).  The  limited  amount  of  training  data 
available  for  the  lowest-level  models  may  be  an  issue: 
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with  K  =  5,  for  instance,  only  64%  of  the  total  data 
is  available  on  the  lowest-level  cross-validation  runs. 

To  conclude  our  discussion,  we  note  that  as  described, 
sequential  stacking  increases  run-time  of  the  base 
learning  method  by  approximately  a  constant  factor  of 
K  +  2.  (To  see  this,  note  sequential  stacking  requires 
training  K+2  classifiers:  the  classifiers  /i, . . . ,  /if  used 
in  cross-validation,  and  the  final  classifiers  /  and  /'.) 
When  data  is  plentiful  but  training  time  is  limited,  it 
is  also  possible  to  simply  split  the  original  dataset  S 
into  two  disjoint  halves  Si  and  S'2,  and  train  two  clas¬ 
sifiers  /  and  /'  from  and  S2  respectively  (where  S2 
is  S'2,  extended  with  the  predictions  produced  by  /). 
This  scheme  leaves  training  time  approximately  un¬ 
changed  for  a  linear-time  base  learner,  and  decreases 
training  time  for  any  base  learner  that  requires  super- 
linear  time. 

4  Experimental  Results 

4.1  Additional  Problems 

We  also  evaluated  non-sequential  ME,  MEMMs, 
CRFs,  s-ME,  and  s-CRFs  on  several  other  sequen¬ 
tial  partitioning  tasks.  For  stacking,  we  used  K  —  5 
and  a  window  size  of  Wh  =  W/  =  5  on  all  problems. 
These  were  the  only  parameter  values  explored  in  this 
section,  and  no  changes  were  made  to  the  sequen¬ 
tial  stacking  algorithm,  which  was  developed  based  on 
observations  made  from  the  signature-detection  task 
only. 

One  set  of  tasks  involved  classifying  lines  from  FAQ 
documents  with  labels  like  “header”,  “question”,  “an¬ 
swer”,  and  “trailer”.  We  used  the  features  adopted 
by  McCallum  et  al  [9  and  the  three  tasks  (ai-general, 
ai-neural-nets,  and  aix)  adopted  by  Dietterich  et  al 

5  .  The  data  consists  of  5-7  long  sequences,  each  se¬ 
quence  corresponding  to  a  single  FAQ  document;  in 
total,  each  task  contains  between  8,965  aand  12,757 
labeled  lines.  Our  current  implementation  of  sequen- 


Figure  3:  Comparision  of  the  error  rates  for  s-ME  with 
the  error  rates  of  ME,  MEMM,  and  CRFs. 


tial  stacking  only  supports  binary  labels,  so  we  consid¬ 
ered  the  two  labels  “trailer”  (T)  and  “answer”  (A)  as 
separate  tasks  for  each  FAQ,  leading  to  a  total  of  six 
new  benchmarks. 

Another  set  of  tasks  were  video  segmentation  tasks,  in 
which  the  goal  is  to  take  a  sequence  of  video  “shots” 
(a  sequence  of  adjacent  frames  taken  from  one  cam¬ 
era)  and  classify  them  into  categories  such  as  “an¬ 
chor”,  “news”  and  “weather”.  This  dataset  contains 
12  sequences,  each  corresponding  to  a  single  video  clip. 
There  are  a  total  of  418  shots,  and  about  700  fea¬ 
tures,  which  are  produced  by  applying  LDA  to  a  5x5, 
125-bin  RGB  color  histogram  of  the  central  frame  of 
the  shot.  (This  data  was  provided  by  Yik-Cheung 
Tam  and  Ming-yu  Chen.)  We  constructed  two  sep¬ 
arate  video  partitioning  tasks,  corresponding  to  the 
two  most  common  labels. 

All  eight  of  these  additional  tasks  are  similar  to  the 
signature-detection  task  in  that  they  contain  long  runs 
of  identical  labels,  leading  to  strong  regularities  in  con¬ 
structed  history  features.  Error  rates  for  the  learning 
methods  on  these  eight  tasks,  in  addition  to  the  previ¬ 
ous  signature-detection  task,  are  shown  in  Table  3  In 
each  case  a  single  train/test  split  was  used  to  evaluate 
error  rates.  The  bold-faced  entries  are  the  lowest  error 
rate  on  a  row. 

We  observe  that  MEMMs  suffer  extremely  high  error 
rates  on  two  of  the  new  tasks  (finding  “answer”  lines 
for  ai-general  and  ai-neural-nets),  suggesting  that  the 
“anomolous”  behavior  shown  in  signature-detection 
may  not  be  uncommon,  at  least  in  sequential  parti¬ 
tioning  tasks. 

Also,  comparing  s-ME  to  ME,  we  see  that  s-ME  im¬ 
proves  the  error  rate  in  8  of  9  tasks,  and  leaves  it  un¬ 
changed  once.  Furthermore,  s-ME  has  a  lower  error 
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Table  4:  Comparision  of  different  sequential  algo¬ 
rithms  on  a  set  of  nine  benchmark  tasks. 
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rate  than  CRTs  7  of  9  times,  and  has  the  same  error 
rate  once.  There  is  only  one  case  in  which  MEMMs 
have  a  lower  error  rate  than  s-ME. 

Overall,  s-ME  seems  to  be  preferable  to  either  of  three 
older  approaches  (ME,  MEMMs,  and  GRES).  This  is 
made  somewhat  more  apparent  by  the  scatter  plot  of 
Figure  3.  On  this  plot,  each  point  is  placed  so  the 
y-axis  position  is  the  error  of  s-ME,  and  the  cc-axis 
position  is  the  error  of  an  earlier  learner;  thus  points 
below  the  line  y  =  x  are  cases  where  s-ME  outperforms 
another  learner.  (For  readability,  the  range  of  the  x 
axis  is  truncated — it  does  not  include  the  highest  error 
rates  of  MEMM.) 

Stacking  also  improves  CRF  on  some  problems,  but 
the  effect  is  not  as  consistent:  s-CRF  improves  the 
error  rate  on  5  of  9  tasks,  leaves  it  unchanged  twice, 
and  increases  the  error  rate  twice.  In  the  table,  one  of 
the  two  stacked  learners  has  the  lowest  error  rate  on  8 
of  the  9  tasks. 


Figure  4:  Comparision  of  the  error  rates  various  algo¬ 
rithms  with  and  without  sequential  stacking. 


Method 

A 

W-L-T 
s-A  vs.  A 

null 

hypothesis 

confidence 
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ME 

8-0-1 

E[A(A)]>0 

>0.98 

VP 
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CRF 
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Table  5:  Comparison  of  stacked  vs  unstacked  learners, 
using  a  one-tailed  sign  test  on  error  rates  obtained  on 
the  nine  benchmark  problems.  Here  A(H)=error(s-H)- 
error(yl),  i.e.,  the  difference  in  errors  between  s-H  and 
H  on  a  randomly  selected  task. 

in  practice,  it  is  often  the  case  that  probabilistic  meth¬ 
ods  work  best  on  some  problems,  and  margin-based 
methods  work  best  on  others. 


4.2  Additional  Base  Learners 

We  conducted  the  same  experiments  with  two  margin- 
based  base  learners:  the  non-sequential  voted  percep¬ 
tron  algorithm  (VP)  6  and  a  voted-perceptron  based 
training  scheme  for  HMMs  proposed  by  Collins  (VP- 
HMMs)  [3  .  Table  4  shows  the  results  for  these  meth¬ 
ods,  and  their  sequentially-stacked  versions.  Both  the 
sequential  and  non-sequential  voted  perceptrons  were 
run  for  20  epochs. 

In  this  case,  s-VP  outperforms  or  ties  both  VP  and 
VPHMM  on  all  nine  problems.  The  s-VPHMM  has 
lower  error  rate  than  the  VPHMM  4  times,  a  higher 
error  rate  once,  and  the  same  error  rate  4  times. 

There  does  not  seem  to  be  any  clear  pattern  in  the  rel¬ 
ative  performance  between  s-ME  and  s-VP — neither 
method  consistently  outperforms  the  other.  Nor  does 
any  clear  pattern  appear  in  the  relative  performance  of 
s-CRF  and  s-VPHMM.  This  is  not  unexpected,  since 


4.3  Overview  of  results 

An  overview  of  the  improvements  obtained  by  sequen¬ 
tial  stacking  on  these  problems  is  shown  in  Figure  4 
and  Table  5,  The  scatter  plot  shows  the  error  rate 
of  ME  plotted  against  the  error  rate  of  ME,  the  er¬ 
ror  rate  of  s-VP  plotted  against  VP,  and  similarly  for 
s-VPHMM  vs  VPHMM  and  CRFS  vs  s-CRFs. 

The  plot  shows  a  plausible  pattern:  sequential  stacking 
nearly  always  improves  the  performance  on  the  non¬ 
sequential  learners  (ME  and  VP)  but  improves  per¬ 
formance  of  the  sequential  learners  (CRFs  and  VPH- 
MMs)  less  consistently.  This  pattern  is  confirmed  by 
a  series  of  one-tailed  sign  tests  performed  on  pairs  of 
learners,  which  are  summarized  in  Table  5 

The  sign  test  does  not  consider  the  amounts  by  which 
error  rates  are  changed.  From  the  figures  and  tables, 
it  is  clear  that  when  error  rates  are  lowered,  they  are 
often  lowered  substantially.  However,  even  for  CRFs, 


the  error  rate  is  only  once  raised  by  more  than  a  very 
small  proportion  (for  the  “A/aix”  benchmark). 

5  Conclusions 

Sequential  partitioning  tasks  are  sequential  classifica¬ 
tion  tasks  characterized  by  long  runs  of  identical  la¬ 
bels:  examples  of  these  tasks  include  document  analy¬ 
sis,  video  segmentation,  and  gene  finding.  In  this  pa¬ 
per,  we  have  evaluated  the  performance  of  certain  well- 
studied  sequential  probabilistic  learners  to  sequential 
partitioning  tasks.  It  was  observed  that  MEMMs 
sometimes  obtain  extremely  high  error  rates.  Error 
analysis  suggests  that  this  problem  is  neither  due  to 
“label  bias”  [8  nor  “observation  bias”  7  ,  but  to  a 
mismatch  between  the  data  used  to  train  the  MEMM’s 
local  model,  and  the  data  on  which  the  MEMM’s  lo¬ 
cal  model  is  tested.  In  particular,  since  MEMMs  are 
trained  on  “true”  labels  and  tested  on  “predicted”  la¬ 
bels,  the  strong  correlations  between  adjacent  labels 
associated  sequential  partitioning  tasks  can  be  mis¬ 
leading  to  the  MEMM’s  learning  method. 

Motivated  by  these  issues,  we  derived  a  novel  method 
in  which  cross-validation  is  used  correct  this  mis¬ 
match.  The  end  result  is  a  meta-learning  scheme 
called  stacked  sequential  learning.  Sequential  stack¬ 
ing  is  simple  to  implement,  can  be  applied  to  virtu¬ 
ally  any  base  learner,  and  imposes  an  constant  over¬ 
head  in  learning  time  (the  constant  being  the  number 
of  cross-validation  folds  plus  two).  In  experiments  on 
several  partitioning  tasks,  sequential  stacking  consis¬ 
tently  improves  the  performance  of  two  non-sequential 
base  learners,  often  dramatically.  On  our  set  of  bench¬ 
mark  problems,  sequential  stacking  with  a  maximum- 
entropy  learner  as  the  base  learner  outperforms  CRFs 
7  of  9  times,  and  ties  once.  Perhaps  more  surprisingly, 
sequential  stacking  also  often  improves  performance 
of  learners  specifically  designed  for  sequential  tasks, 
such  as  conditional  random  fields  and  discriminately 
trained  HMMs. 

Some  initial  experiments  on  a  named  entity  recogni¬ 
tion  problem  suggest  that  sequential  stacking  does  not 
improve  performance  on  non-partitioning  problems; 
however,  in  future  work,  we  plan  to  explore  this  issue 
with  more  detailed  experimentation.  We  also  plan  to 
extend  our  implementation  to  handle  non-binary  label 
sets. 
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